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ABSTRACT 

The DNA Data Bank of Japan (DDBJ; http://www. 
ddbj.nig.ac.jp) maintains and provides archival, re- 
trieval and analytical resources for biological infor- 
mation. This database content is shared with the US 
National Center for Biotechnology Information 
(NCBI) and the European Bioinformatics Institute 
(EBI) within the framework of the International 
Nucleotide Sequence Database Collaboration 
(INSDC). DDBJ launched a new nucleotide 
sequence submission system for receiving trad- 
itional nucleotide sequence. We expect that the 
new submission system will be useful for many sub- 
mitters to input accurate annotation and reduce the 
time needed for data input. In addition, DDBJ has 
started a new service, the Japanese Genotype- 
phenotype Archive (JGA), with our partner institute, 
the National Bioscience Database Center (NBDC). 
JGA permanently archives and shares all types of 
individual human genetic and phenotypic data. We 
also introduce improvements in the DDBJ services 
and databases made during the past year. 

INTRODUCTION 

The DNA Data Bank of Japan (DDBJ, http://www.ddbj. 
nig.ac.jp) (1) is one of the three databanks that comprise 
the DDBJ/ENA/GenBank International Nucleotide 
Sequence Database (INSD) (2-4), which is a close collab- 
oration between DDBJ, the European Bioinformatics 
Institute (EBI) in Europe and the National Center for 
Biotechnology Information (NCBI) in the USA. Like 
ENA and GenBank, DDBJ provides biological databases 
and analytical services to researchers to support biological 
research. 



We have already reported that our previous supercom- 
puter system has been totally replaced by a new commod- 
ity-cluster-based system (1). Our services including Web 
services, submission system, BLAST, CLUSTALW, 
WebAPI and databases are now running on a new super- 
computer system. In addition to a traditional assembled 
sequence archive, DDBJ provides the DDBJ Sequence 
Read Archive (DRA) (5) and the DDBJ BioProject (6) 
for receiving short reads from next-generation sequencing 
(NGS) machines and organizing the corresponding data 
obtained from research projects. In 2013, DDBJ has 
started the BioSample database in collaboration with 
INSDC (7,8) to organize sample information with the 
DRA and the BioProject. In recent years, DDBJ has 
devoted energy for constructing databases such as DRA 
and BioProject to prepare for receiving huge numbers of 
sequences from next-generation sequencers and organize 
genome projects. Because of an increase in databases in a 
short period of time, DDBJ has been concerned that it is 
becoming difficult for submitters to submit nucleotide se- 
quences with accurate information. Although the amount 
of traditional assembled sequence data is smaller than that 
of submissions to the DRA (raw data from NGS), the 
annotations describing each nucleotide sequences are 
used for a reference database that helps researchers in 
genome analysis. DDBJ is aware of the importance of 
allowing the submission system to lead submitters to 
provide accurate annotation. DDBJ had long (since 
1995) used SAKURA (9) to receive traditional nucleotide 
sequences via the Web and has replaced it with a new 
system operating on a new supercomputer. The new 
DDBJ nucleotide sequence submission system incorpor- 
ates ideas that will reduce the time needed for completing 
a submission and help submitters provide accurate 
annotation. 

DDBJ decided to launch Japanese Genotype-pheno- 
type Archive (JGA) in collaboration with the National 
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Bioscience Database Center (NBDC) of the Japan Science 
and Technology Agency. JGA is constructed for the 
purpose of archiving human personal genomic and pheno- 
typic data and providing researchers with secure access. 

All resources described here are available from http:// 
www.ddbj.nig.ac.jp. 



DDBJ ARCHIVAL DATABASES IN 2013 

DDBJ traditional assembled sequence archive 

Between July 2012 and June 2013, the DDBJ periodical 
release increased by 1 1 799 452 entries and 1 1 686 547 887 
base pairs. This periodical release, also known as the 
INSD core traditional nucleotide flat files, does not 
include whole-genome shotgun (WGS) and third party 
data (TP A) files (10). The DDBJ contributed 17.8% of 
the entries and 12.2% of the total base pairs added to 
the core nucleotide data of INSD. Most of the nucleotide 
data records provided to the DDBJ (97.83%) were 
submitted by Japanese researchers and the rest coming 
from Korea (1.26%), China (0.54%), Colombia (0.17%), 
Taiwan (0.04%), USA (0.04%) and other countries and 
regions (0.11%). DDBJ has continuously distributed 
sequence data in published patent applications from the 
Japan Patent Office (JPO, http://www.jpo.go.jp) and the 
Korean Intellectual Property Office (KIPO, http://www. 
kipo.go.kr/en). JPO transferred its data to DDBJ directly, 
whereas KIPO transferred its data via an arrangement 
with the Korean Bioinformation Center (KOBIC). A 
detailed statistical breakdown of the number of records 
is shown on the DDBJ homepage (http://www.ddbj.nig. 
ac.jp/breakdown_stats/prop_ent.html). In addition to the 
core nucleotide data, DDBJ has released a total of 
5 099 547 WGS entries, 721 TP A entries, 6374 TPA- 
WGS entries and 1272 TPA-CON entries as of 27 
August 2013. 

Noteworthy large-scale data released from DDBJ are 
listed in Table 1. Regarding the genome of Theileria 
orientalis strain Shintoku, which was submitted by 
Hokkaido University, a DDBJ annotator participated in 
the annotation procedure. Moreover, DDBJ has released 
the following: coelacanth (Latimeria chalumnae) genome 
submitted by Tokyo Institute of Technology; diamond- 
back moth (Plutella xylostella) draft genome submitted 
by the National Institute of Agrobiological Sciences; 
Jatropha curcas genome submitted by the Kazusa DNA 
Research Institute; mouse (Mus musculus) strain MSM/ 
Ms draft genome submitted by the National Institute of 
Genetics (NIG); two sets of genome survey sequences 
(GSS) of African clawed frog (Xenopus laevis) submitted 
by the NIG; GSS of coelacanth (L. chalumnae) submitted 
by the NIG; transcriptome shotgun assemblies (TSA) of 
great pond snail (Lymnaea stagnalis) submitted by 
Tokushima Bunri University; TSA of a forest soil 
metagenome submitted by the National Institute of 
Advanced Industrial Science and Technology; expressed 
sequence tags (EST) of a species of planarians (Dugesia 
japonica) submitted by RIKEN; EST of white-tufted-ear 
marmoset (Callithrix jacchus) submitted by the NIG. 



Sequence read archive, BioProject and BioSampIe 
databases 

As an INSDC activity, the DDBJ decided to launch the 
BioSampIe database to organize sample information 
across archival databases. The study and sample objects 
of the Sequence Read Archive will be migrated to the 
BioProject and BioSampIe records, respectively. 

Japanese Genotype-phenotype Archive 

DDBJ has started a new service, the Japanese Genotype- 
phenotype Archive (JGA, http://trace.ddbj.nig.ac.jp/jga) 
with our partner institute, the NBDC. JGA is a service 
for permanent archiving and sharing of all types of 
personal genetic and phenotypic data resulting from bio- 
medical research projects. JGA contains exclusive data 
collected from individuals whose consent agreements au- 
thorize data release only for specific research use. JGA 
provides controlled access to individual data as the 
database of Genotypes and Phenotypes (dbGaP) at 
NCBI (11) and the European Genome-phenome Archive 
(EGA) at EBI (12). 

JGA accepts only de-identified data with a NBDC 
approved access plan. Users directly apply for data sub- 
mission to NBDC, and JGA will accept and process 
submissions only when notification of a successful appli- 
cation process has been passed from NBDC to JGA. The 
accepted data types include manufacturer-specific raw 
data formats from array based and new sequencing plat- 
forms. The processed data, including genotype and struc- 
tural variants or any summary statistical analyses, are 
stored in databases. JGA also accepts and distributes 
any phenotype data associated with the samples. 

JGA implements access-granting policy, whereby deci- 
sions on granting access to the data reside with NBDC. 
Users must be approved by NBDC to access the JGA 
data. 



DDBJ SYSTEMS PROGRESS 

A new submission system: D-easy 

DDBJ has launched a new Web-based nucleotide 
sequence submission system (code name: D-easy) for 
receiving traditional submissions after 3 October 2012. A 
former Web-based submission system, SAKURA, was 
terminated on 31 October 2012. SAKURA was originally 
created for the submission of small-scale data in 1995 and 
used for a long period. In recent years, we suspected that 
SAKURA was not suitable for the current volume of 
nucleotide data registration, given the large number of 
nucleotide sequences that can now be acquired rapidly 
and cheaply. Accordingly, DDBJ decided to develop a 
new nucleotide data submission system that works on a 
Web server. The new submission system features an 
improved sequence and annotation input system to 
accept larger numbers of nucleotide sequence data at 
once. We expect that submitters can shorten the time 
required for nucleotide data submission by using the 
new nucleotide data submission tool. 
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Table 1. List of large-scale data released by DDBJ from July 2012 to June 2013 



Type 


Ortra niQTn 

V / 1 eiclllljlll 


ApppSQinn niirnhpr ftiiiTnhpr nf pntrips^ 


Genome 


Theileria orientalis strain Shintoku 


Chromosomes: AP011946-AP011949 (4 entries) 

Anirrmlnct- AP01 1QS0 
/A.piL-Upid.3 1. nrUl 17JU 

IVIILOLIIUIIUI IOII. rtrUl l7Jl 




i j^fM^Puntn i / /7 f Mii/>f* t /'i /*ri flit iwivtfiGi 

V^UClcH-ClllLll \L^LllUllv.l III LI lUlUt 1 If tUC ) 


SraffnlH CON- DF1 ^RQOfi-DFI 96766 H7X61 entries 
WOS- RAHO01000001-RAHO01475474 ^475474 pntrip«l 




Diamondback moth {Plutella xylostella) 


BAGR01000001-BAGR01088530 {88 530 entries) 




J citi'ophci curcQ-S 


WGS: BABX02000001-BABX02066610 (66 610 entries) 




Mouse (Mus musculus MSM/Ms) 


WGS: BAAG010000001-BAAG01 1237600 (1237 600 entries) 


GSS 


African clawed frog (Xenopus laevis) 


GA131508-GA388245 (251 621 entries; some missing) 
GA720358-GA867435 (147078 entries) 




Coelacanth (L. chalumnae) 


GA605430-GA720357 (114 928 entries) 


TSA 


Great pond snail (Lymnaea stagnalis) 


FX180119-FX296473 (116 355 entries) 




Forest soil metagenome 


FX000001-FX056084 (56 084 entries) 


EST 


Planarian (Dugesia japonica) 


5'-EST: FY925127-FY960824 (35 698 entries) 
3'-EST: FY960825-FY979285 (18461 entries) 




White-tufted-ear marmoset (Callithrix jacchus) 


HX373156-HX663542 (290387 entries) 



We identified problematic issues in using SAKURA, 
which should be improved in a new submission tool. In 
particular, submission of multiple nucleotide sequences 
was very time-consuming because nucleotide sequences 
were required to be submitted one by one; multiple- 
FASTA files were not accepted. It was difficult for a sub- 
mitter to identify the feature key(s) best matching the 
annotation of a nucleotide sequence, leading to frequent 
failure of the submitter to input essential values. 
Sometimes we could not contact a submitter because 
of a mistyped e-mail address. In view of these problems 
of SAKURA, we identified ideas that would be helpful for 
inputting annotation and nucleotide sequence: 

- To reduce the time required for inputting annota- 
tion, a spreadsheet input method will be better. 

- For typical annotation, such as 16S rRNA, mRNA 
and D-loop, an annotation template system would be 
useful. Each template has feature and qualifier sets 
that are mandatory for each annotation. For 
example, when the 16S rRNA template is selected, 
source and rRNA feature are automatically chosen. 

Finally, we equipped D-easy with the following 
functions. 

(i) The e-mail address is authenticated during submis- 
sion. A submitter, after inputting contact informa- 
tion, receives an e-mail message and must click on 
a link in the message to input the nucleotide data. 
This function obviates sending a fax to the submit- 
ter to know a correct e-mail address. 

(ii) In general, a submitter uses a non-spreadsheet 
input system, which is a standard annotation 
input form of D-easy (Figure 1A). All feature 
and qualifier keys, which are allowed in DDBJ's 
annotation rule, are selectable and submitter can 
input any type of annotation. Only source feature 
is displayed at the start screen of the input form 
and the submitter must add required feature keys 
to fill the annotation. Mandatory or recommended 



qualifiers are automatically chosen when a feature 
is added. A submitter will use the non-spreadsheet 
type annotation input form when 'No, I cannot 
find my kind of annotation in the list above' is 
chosen on the template selection page. 
If the annotation pattern of entire nucleotide se- 
quences is included in the prepared template, the 
submitter can use a spreadsheet-type annotation 
input system (Figure IB). The sheet looks like a 
table having rows whose number is equal to 
number of nucleotide sequences and columns 
including source feature and a preset feature (e.g., 
source + CDS, source + rRNA). Though, submitter 
cannot add, delete and change the preset feature 
keys, submitter can easily know required feature 
and qualifier keys for the annotation because 
required qualifiers are preselected from the start 
(see (iii) and (iv) below). Useful functions, such 
as copying a value downward or pasting text list 
to multiple entries, are available in a spreadsheet- 
type annotation input system, helping the submitter 
to shorten the time necessary to complete the 
submission. 

(iii) Specialized templates have been prepared for data 
submission, including bacterial 16S rRNA, mRNA, 
influenza A virus gene, D-loop, etc. Each template 
has feature keys that must be used in the 
annotation. 

(iv) Available qualifier keys are automatically chosen 
with each template. For example, if the submitter 
selects a template from 'Sequences from isolated 
bacteria or archaea', qualifiers used in source 
feature are limited to those relevant to bacteria 
or archaea. 

(v) In the system, qualifier keys are classified into three 
ranks: 'mandatory', 'recommended' and 'optional', 
according to the priority of usage. The 'mandatory' 
and 'recommended' qualifiers are automatically 
displayed on the screen to inform the submitter 
what information is required for the annotation. 
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Figure 1. Screen shot of a new nucleotide submission system, D-easy. In general, the submitter uses a non-spreadsheet input method (A). When the 
submitter selects an annotation pattern that is prepared as a template page (e.g., 16S rRNA sequence) spreadsheet input is available (B). 



The 'optional' qualifiers are manually added by the 
submitter if needed, 
(vi) The new submission system can receive multi- 
FASTA format, which means that a user can 
submit multiple nucleotide sequences at once. 

D-easy is fast and requires low effort to prepare a sub- 
mission of multiple nucleotide sequences. It is available 
from http://www.ddbj.nig.ac.jp/sub/websub-e.html. 

Sequence analytical services: WebBlast, ClustalW and 
DRA Pipeline 

The Web BLAST (13) services of DDBJ are hosted on the 
supercomputer system of the NIG (1). The NIG super- 
computer system (whose operation started in March 
2012) is a typical HPC cluster system, consisting of 
general-purpose calculation nodes (thin-node 64 GB 
memory, 352 nodes) and calculation nodes for memory- 
intensive tasks including de novo assembly of NGS data 



(2 medium nodes each with 2 TB of memory and 1 fat 
node with 10 TB of memory). These calculation nodes are 
interconnected with InfiniBand QDR by a complete bisec- 
tion fat-tree topology. In addition, to allow the many cal- 
culation nodes to read and write the same files in parallel, 
the NIG supercomputer is equipped with 2 PB of the 
Lustre parallel distributed file system (http://www. lustre, 
org/) as a high-performance large external storage system. 

Web BLAST service of DDBJ is hosted on several Web 
servers in the NIG supercomputer system and receives 
user requests (query sequences) from a Web input form 
or a file upload form. DDBJ also provides the newer 
version of Web API for Bioinformatics (WABI) (14-16), 
a RESTful Web API service that can process many 
requests sent from computer programs. 

Requests sent both to Web applications and to Web 
APIs are funneled to a resource scheduler, the Univa 
Grid Engine (UGE, http://www.univa.com/products/ 
grid-engine.php), on the NIG supercomputer system. 
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UGE manages the requested jobs in the work queue and 
allocates calculation nodes to each job. This configuration 
drastically decreases the workload of the Web servers so 
that they can accept as many as 1000 requests per second. 
This ability is important for large-scale data analysis using 
Web APIs. At present, 28 thin nodes (448 cores) of the 
NIG supercomputer are assigned for the Web BLAST and 
WABI services. By the beginning of March 2014, the NIG 
supercomputer system will be enhanced to accommodate 
the increasing demands of massive data analysis. This 
enhanced system will be equipped with 7 PB of external 
storage in the Lustre system and ~550 general-purpose 
calculation nodes. Along with enhancement, the perform- 
ance of the Web applications and Web APIs will be 
improved to a considerable extent. 

Other Web applications and RESTful Web APIs 
working in similar ways are planned for release. The 
VecScreen system (http://www.ncbi.nlm.nih.gov/tools/ 
vecscreen/univec/), ClustalW (17,18), MAFFT (19,20) 
and a keyword search system for DDBJ flat files will be 
released in the near future. Our current keyword search 
system, ARSA (1), searches against only a single database, 
traditional DDBJ flat files. It is expected that target 
database of search system will be expanded to all DDBJ 
databases because API-based search indexing is also ap- 
plicable to other databases, such as DRA, BioProject, 
BioSample, etc. 

The DDBJ Read Annotation Pipeline (DDBJ Pipeline, 
http://p. ddbj.nig.ac.jp/) is a high-throughput Web anno- 
tation system of NGS reads running on the NIG super- 
computer (21). The DDBJ Pipeline offers a user-friendly 
graphical Web interface and processes massive NGS 
datasets using decentralized processing by NIG supercom- 
puters, which is currently free of charge. The pipeline 
consists of two analysis components: basic analysis for 
reference genome mapping and de novo assembly and sub- 
sequent high-level analysis of structural and functional 
annotations. The high-level analysis employs a non- 
original Galaxy interface (22), given that the diverse 
workflows require flexible connections with respective 
analytical programs. Public NGS reads from the DRA 
located on the same supercomputer can be imported 
into the pipeline via input of only an accession number. 
In the 2013 update, the functional enhancement of DDBJ 
Pipeline covers new three workflows: de novo transcrip- 
tome annotation using trinity (23), DNA polymorphism 
detection with multiple strains, and human HLA haplo- 
type detection (24). As an extension of the DDBJ Pipeline, 
we are now preparing to develop a virtual machine image. 



FUTURE DIRECTION 

In this report, we have introduced the 2013 update of 
DDBJ including a thorough renovation of the submission 
system D-easy for the INSDC conventional nucleotide 
database. At present, D-easy works independently and is 
not connected to other DDBJ database services. We are 
planning to connect D-easy under the D-way system to 
enable users to submit their comprehensive data to 
relevant DDBJ databases such as the traditional DDBJ, 



DRA, BioProject and BioSample. We are also improving 
the organism input and template systems to increase the 
variety of templates involving many annotation types. 

DDBJ is promoting the integration of databases using 
Semantic Web technology in cooperation with the 
Database Center for Life Science (DBCLS, http://dbcls. 
rois.ac.jp/en/) and NBDC. We are developing the INSDC/ 
DDBJ ontology as a standardized, systematic description 
of a DDBJ sequence entry, including information about 
submitters, references, source organisms and biological 
feature annotations. A Web Ontology Language for 
INSDC Feature Table Definition, which is a common an- 
notation document revised by INSDC once a year, has 
been generated (25). In the DDBJ nucleotide submission 
system, we have planned for the use of SPARQL querying 
of the INSDC/DDBJ ontology as the submission system 
configuration instead of just updating the submission 
system each year. In the near future, we will support the 
RDFization of DDBJ nucleotide sequence entry and a 
public SPARQL endpoint toward the integration of 
DDBJ resources including the submission system and 
public release of data. 



ADDITIONAL INFORMATION 

More information is available on the DDBJ website at 
http://www.ddbj.nig.ac.jp. News is delivered by really 
simple syndication (RSS), Twitter and mail magazines. 
Instructional videos of PowerPoint slides for explaining 
the DDBJ submission system are also available through 
'DDBJ YouTube Channel' on YouTube. 
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